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Abstract — This paper presents a technique for Informed Source 
Separation (ISS) of a single channel mixture, based on the Mul- 
tiple Input Spectrogram Inversion method. The reconstruction 
of the source signals is iterative, alternating between a time- 
frequency consistency enforcement and a re-mixing constraint. 
A dual resolution technique is also proposed, for sharper tran- 
sients reconstruction. The two algorithms are compared to a 
state-of-the-art Wiener-based ISS technique, on a database of 
fourteen monophonic mixtures, with standard source separation 
objective measures. Experimental results show that the proposed 
algorithms outperform both this reference technique and the 
oracle Wiener filter by up to 3dB in distortion, at the cost of a 
significantly heavier computation. 

Index Terms — Informed source separation, adaptive Wiener 
filtering, spectrogram inversion, phase reconstruction. 



I. Introduction 

Audio source separation has attracted a lot of interest in the 
last decade, partly due to significant theoretical and algorith- 
mic progress, but also in view of the wide range of applications 
for multimedia. Should it be in video games, web conferencing 
or active music listening, to name but a few, extraction of the 
individual sources that compose a mixture is of paramount 
importance. While blind source separation techniques (e.g. 0) 
have made tremendous progress, in the general case they still 
cannot guarantee a sufficient separation quality for the above- 
noted applications when the number of sources gets much 
larger than the number of audio channels (in many cases, 
only 1 or 2 channels are available). The recent paradigm of 
Informed Source Separation (ISS) addresses this limitation, by 
providing to the separation algorithm a small amount of extra 
information about the original sources and the mixing function. 
This information is chosen at the encoder in order to maximize 
the quality of separation at the decoder. ISS can then be 
seen as a combination of source separation and audio coding 
techniques, taking advantage of both simultaneously. Actually, 
the challenge of ISS is to find the best balance between the 
final quality of the separated tracks and the amount of extra 
information, so that is can easily be transmitted alongside the 
mix, or even watermarked into it. 

Techniques such as 0, 0, (4) for stereo mixtures, and 
[5|, [6|, also applicable to monophonic mixtures, are all based 
on the same principle: coding energy information about each 
source in order to facilitate the posterior separation. Sources 
are then recovered by adaptive filtering of the mixture. For the 
sake of clarity, we will assume a monophonic case, in a linear 
and instantaneous mixing (further extensions will be discussed 
in the discussion Section) : J sources Sj(t), j — 1 ... J, are 
linearly mixed into the mix signal m(t) = J2j s j(t)- ^ tne 
local time-frequency energy of all sources is known, noted 



\Sk(f, t)\ 2 , k = 1 ... J, then the individual source sj(t) can 
be estimated from the mix m(t) using a generalized time- 
frequency Wiener filter in the Short-Time Fourier Transform 
(STFT) domain. Computing the Wiener filter atj of source j 
is equivalent to computing the relative energy contribution of 
the source with respect to the total energy of the sources. At 
a given time-frequency bin (t, /), one has : 
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The estimated source Sj(t) is then computed as the inverse 
STFT (e.g., with overlap-add techniques) of the weighted 
signal aj(t, f)M(t, /), with M the STFT of the mix m. 

This framework has the advantage that, by construction, the 
filters oij sum to unity, and this guarantees that the so-called 
re-mixing constraint is satisfied : 



^2§ j (t) = m(t). 



(2) 



The main limitation, however, is in the estimation of the phase: 
only the magnitude Sj(t, f) of each source is estimated by this 
adaptive Wiener filter, and the reconstruction uses the phase 
of the mixture. While this might be a valid approximation for 
very sparse sources, when 2 sources, or more, are active in 
the same time-frequency bin, this leads to biased estimations, 
and therefore potentially audible artifacts. 

In order to overcome this issue, alternative source separation 
techniques have been designed 0, 0, taking advantage of 
the redundancy of the STFT representation. They are based 
on the classical algorithm of Griffin and Lim (G&L)[9|, that 
iteratively reconstructs the signal knowing only its magnitude 
STFT. Again, these techniques only use the energy information 
of each source as prior information, but perform iterative 
phase reconstruction. For instance, the techniques developed 
in Q, [8] are shown to outperform the standard Wiener filter. 
However, in return, reconstructing the phases breaks the re- 
mixing constraint Q. 

The goal of this paper is to propose a new ISS framework, 
based on a joint estimation of the source signals by an iterative 
reconstruction of their phase. It is based on a technique called 
Multiple Input Spectrogram Inversion (MISI) [ 10 1, that at each 
iteration distributes the remixing error e = m(t) — ■ Sj 
amongst the estimated sources and therefore enforces the 
remixing constraint. It should be noted that, within the context 
of ISS, it uses the same prior information (spectrogram^] or 
quantized versions thereof) as the classical Wiener estimate. 

'The word spectrogram is used here to refer to the squared magnitude of 
the STFT 
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Therefore, the results of the oracle Wiener estimate will be 
used as baseline throughout this paper, "oracle" meaning here 
with perfect (non-quantized) knowledge of the spectrogram of 
every source. 

In short, the two main contributions of this article can be 
summarized as follows : 

• the modification of the MISI technique to fit within a 
framework of ISS. The original MISI technique iflOl 
benefits from a high overlap between analysis frames 
(typically 87.5 %), and the spectrograms are assumed to 
be perfectly known. The associated high coding costs are 
not compatible with a realistic ISS application, where the 
amount of side information must be as small as possible. 
We show that a controlled quantization, combined with a 
relaxed distribution of the remixing error, leads to good 
results even at small rates of side information. 

• a dual-resolution technique that adds small analysis win- 
dows at transients, significantly improving the audio 
quality where it is most needed, at the cost of a small - but 
controlled - increase of the amount of side information. 

All these experimental configurations are evaluated for a 
variety of musical pieces, in a context of ISS. 

The paper is organized as follows: a state of the art is 
given in Section [II] where the G&L and MISI techniques are 
presented. In Section III we propose an improvement to MISI, 



with preliminary experiments and discussion. In Section IV 



we address the problem of transients and update our method 
with a dual-resolution analysis. In Section [V] the full ISS 
framework is presented, describing both coding, decoding and 
reconstruction strategies. Experimental results are presented in 



Section VI with a discussion on various design parameters. 



Finally, Section VII concludes this study. 



II. State of the art 
A. Signal reconstruction from magnitude spectrogram 

By nature, an STFT computed with an overlap between 
adjacent windows is a redundant representation. As a con- 
sequence, any set of complex numbers S G C MxN does not 
systematically represent a real signal in the time-frequency 
(TF) plane. As formalized in ifTTI . the function Q = 
STFT[STFT~ 1 [.]] is not a bijection, rather a projection of 
a complex set S G C MxN into the sub-space of the so-called 
"consistent" STFTs, which are the TF representations that are 
invariant trough Q. 

The G&L algorithm [9| is a simple iterative scheme to 
estimate the phase of the STFT from a magnitude spectrogram 
\S\. At each iteration k, the phase of the STFT is updated with 
the phase of the consistent STFT obtained from the previous 
iteration, leading to an estimate: 

=G(\S\e^ §ik ~ 1) ) 

It is shown in [9 1 that each iteration decreases the objective 
function 

MM) « X m , n \\S W {m,n)\-\S(m,n)\\> 
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However, this algorithm has intrinsic limitations. Firstly, 
it processes the full signal at each iteration, which prevents 
an online implementation. This has been addressed in other 
implementations based on the same paradigm, see e.g. Zhu 
et al. ITHl for online processing and LeRoux et al. [ 1 1 1 for 
a computational speedup. Secondly, the convergence of the 
objective function does not guarantee the reconstruction of 
the original signal, because of phase indetermination. The 
reader is redirected to |[l"3l for a complete review on iterative 
reconstruction algorithms and their convergence issues. 



B. Re-mixing constraint and MISI 

In an effort to improve the convergence of the reconstruc- 
tion within a source separation context, Gunawan et al. ifTUl 
proposed the MISI technique, that extracts additional phase 
information from the mixture. Here, the estimated sources 
should not only be consistent in terms of time-frequency 
(TF) representation, they should also satisfy the re-mixing 
constraint, so that the re-mixing of the estimated sources is 
close enough to the original mixture. Let us consider the time- 
frequency remixing error E m so that: 
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Note that E m = when using the Wiener filter. In the case 
of an iterative G&L phase reconstruction, E m ^ at any 
iteration. Here, MISI distributes the error equally amongst the 



sources, leading to the corrected source at iteration k, C 
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where J is the number of sources. 

Therefore, if the spectrogram of the source is perfectly 
known, it only consists in adapting the G&L technique with 
an additional phase update based on the re-mixing error: 
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(6) 



and the MISI algorithm alternates steps |4] [5] and [6] It should 
be emphasized that, with MISI, the time-domain estimated 
sources do not satisfy the remixing constraint (equation (|2jl), 
step |4} playing a role only in the estimation of the phase. 

III. Enhancing the iterative reconstruction 

The MISI technique [10| presented in the previous section 
assumes that the spectrogram of every source is perfectly 
known. However, in the framework of ISS, we have to transmit 
the spectrogram information of each source with a data rate 
that is as small as possible, i.e. with quantization. At low bit 
rates (coarse quantization), the spectrograms may be degraded 
up to the point that modulus reconstruction is necessary. 
Therefore we will not only perform a phase reconstruction 
as in MISI, but a full TF reconstruction (phase and modulus) 
from the knowledge of both the mixture and the degraded 
spectrogram. 



3 



50 % overlap 



40 r- 
30- 
20- 
10- 
— 



+ no quantization 
O 2dB quantization 
x 4dB quantization 



Fig. 2. MISI separation results on the test signal, for different spectrogram 
quantization levels. Scores are relative to the oracle Wiener filter, and error 
bars indicate standard deviations. 



A. Activity-based error distribution 

It is here assumed that only a degraded version of the source 
spectrogram is given. Equation |5]) can still be used to rebuild 
both magnitude and phase of the STFT. However, a direct 
application of this technique leads to severe crosstalk, as some 
re-mixing error gets distributed on sources that are silent. 

In order to only distribute the error where needed, we define 
a TF domain where a source is considered active based on its 
normalized contribution ay, as given by the Wiener estimate in 
eqn. [T] For the source j, the activity domain (equation (|7]i) 
is the binary TF indicator where the normalized contribution 
ctj of a source j is above some activity threshold p 



9j(n, m) 



if otj(n, m) > p 
otherwise 



(7) 



Now, the error is distributed only where sources are active: 

J<%„ ^\(r.(Q(k-i^ , E m (n,m 
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(8) 



D(n, m) 

where D(n,m) is a TF error distribution parameter. It is 
possible to compute D(n,m) as the number N a of active 
sources at TF bin (n,m) (i.e., D(n,m) — X)j^j( n : m ))- 
However, it was noticed experimentally that a fixed D such 
that D » N a provides better results. This means that only a 
small portion of the error is added at each iteration, and that 
the successive TF consistency constraint enforcements (the Q 
function) validate or invalidate the added information. The 
exact tuning of parameters D and p is based on experiments, 
as discussed in section |III-B| We expect that the lower p, the 
lesser the artifacts of the reconstruction, but also the higher the 
crosstalk (sources interferences) because the remixing error is 
distributed on a higher number of bins. 

B. Preliminary experiments 

A first test is performed to validate the proposed design, 
and to experiment on the various parameters. We use a 
monophonic music mixture of electro-jazz at a 16bits/44. 1kHz 
format. Five instruments are playing in this mixture : a bass, 
a drum set, a percussion, an electric piano and a saxophone. 
These instruments present characteristics that interfere with 
one another. For instance, the bass guitar and the electric piano 
are heavily interfering in low frequencies, whereas drums and 
percussions both have strong transients. The saxophone is very 
breathy but the breath contribution is far below the energy of 
the harmonics. 



The spectrograms are log-quantized (in dB, cf fl4l . (6)) 
with three quantization steps : u = (no quantization), 2 and 
4dB. For each of these three conditions, we use two overlap 
values of 50% and 75% and a window size of 2048 samples 
at 44,1kHz sampling rate. Two values of the activity threshold 
are tested: p = .1 and .01. The phase of each source is 
initialized with the phase of the mixture, and 50 iterations 
were performed. 

We test 3 variants of the proposed separation method : 

1) Ml : with D = 40 and activity detection. 

2) M2 : with D = N a and activity detection. 

3) M3 : with D = N a and no activity detection. 

For this evaluation, we use the three objective criteria of 
the BSS Eval toolbox lfT5l . namely the Source to Distortion 
Ratio (SDR), the Source to Interference Ratio (SIR) and the 
Source to Artifact Ratio (SAR). Results given on Figure [T] are 
relative to the Oracle Wiener filter estimation performances, 
taken as reference. In the present experiment the absolute mean 
(respectively, standard deviation) of the Oracle Wiener filter 
were : SDR = 9.0 (1.3) dB, SIR = 21 (5.1) dB, SAR = 9.4 
(1.2) dB for both 50% and 75% overlap. Results of MISI on 
the same signal are given on Figure [2] 

C. Discussion 

The results are presented on Figures [T] and [2] and the 
reconstructed sources are available on the demo webpage 
0161 . The performance of unquantized MISI is very high, 
but decreases rapidly when quantization increases. This is 
directly linked to the fact that the spectrogram is constrained, 
which would be even more problematic when part of this 
spectrogram is missing, for bitrate reduction purposes. The 
activity -based error distribution (Ml and M2 vs M3) improves 
significantly the three objective criteria both in mean and stan- 
dard deviation. This is expected as the activity domain prevents 
reconstruction of a source on a bin where its contribution to 
the mixture is negligible. One can also see that lowering the 
activity threshold p (from .1 - upper line - to .01 - lower line 
-) improves the SAR but lowers the SIR: a lower value of p 
distributes the error on a larger amount of bins. While this 
provides less "holes" in the reconstructed TF representation 
(higher SAR), it also involves more crosstalk between sources 
(lower SIR). In every condition, the tradeoff between SIR and 
SAR when lowering p seems to be a loss of about ldB on the 
SIR for a gain of ldB on the SAR. Since the SIR is already 
high on the oracle Wiener filter (> 15dB), it seems a better 
tradeoff to favor SAR, in order to improve the global SDR 
gain. Therefore, the lower value p — .01 will be used for the 
rest of the paper. 

The improvements brought by D >> N a (Ml) compared 
to D — N a (M2) are less important. The precise choice of D 
is experimented on fig. [3] Large values of D seem to provide 
a better convergence: the energy of the error that is distributed 
to a source but that does not belong to it (on a consistency 
basis) will be easily discarded because of its small value and 
because of the energy smearing effect of the Q function. 

When the spectrogram is quantized with u = 4 dB quantiza- 
tion step, the reconstruction performance reaches a maximum 
with D = 40 for 50 iterations. 
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Fig. 1. Separation results for the three variants of the proposed method : Ml (D = 40, activity detection), M2 (D = N a , activity detection) and M3 
(D = N a , no activity detection). Scores are relative to the oracle Wiener filter, and error bars indicate standard deviations. Different parameters are tested : 
the quantization step u, the STFT overlap (50% and 75%) and the activity threshold p. 
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Fig. 3. Different values of D for different number of iterations. 




Fig. 4. Improvements over the Wiener filter for a varying quantization step 
u. Window size of 2048 samples with 50% overlap, p = 0.01. 



Finally, the effect of spectrogram quantization is clear. 
As expected, increasing the quantization steps lowers the 
SDR but also dramatically lowers the SAR because of added 
artifacts caused by the quantization. Figure [4] presents the 
SDR improvement when varying the quantization step u, for 
algorithm Ml. Even for a relatively high quantization step of 
4dB, results still outperform the oracle Wiener filter. 

To summarize the results of this preliminary experiment, 
we have shown that - at least for the sounds under test - 



the proposed method Ml (activity detection, D = 40) can 
outperform the oracle Wiener filter, while keeping the amount 
of side information low, with a crude quantization of the 
spectrograms (u = 4 dB). However, these results are not 
perfect, especially in terms of perception. When listening to 
the sound examples (available online |[T6l ). one can hear a 
number of artifacts, especially at transients. Indeed, transient 
reconstruction from a spectrogram or from a Wiener filter 
is a well-known issue [17], as time domain localization is 
mainly transmitted by the phase. The next section alleviates 
this problem by using multiple analysis windows. 

IV. Improving transients reconstruction 

The missing phase information at transients leads to a 
smearing of the energy, pre-echo or an impression of over 
smoothness of the attack. In order to prevent these issues, 
a window switching can be used, with shorter STFT at 
transients Q3, ED, ED- In Advanced Audio Coding (AAC) 
for instance, the window switches from 2048 to 256 samples 
when a transient is detected. Here, because we want the same 
TF grid for sources that can have very different TF resolution 
requirements, we do not switch between window sizes but 
rather use a dual resolution at transients, keeping both window 
sizes. Note that this leads to a small overhead in terms of 
amount of side information to encode (both short- and long- 
window spectrograms have to be quantized and transmitted at 
transients), but does not require transition windows. 

A. Transients detection 

We use the same non-uniform STFT grid for every source 
and for the mixture, keeping the ability of TF addition and 
subtraction for error distribution. In order to obtain this non- 
uniform grid, we process in three steps at the coding stage: 

1) a binary transient indicator Tj(t) is computed for each 
source j, using the Complex Spectrum Difference 
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Fig. 5. Large and small windows in the dual-resolution STFT. 

Tj equals to 1 if a transient is detected at time t, 
otherwise. 

2) The transients are combined in T a ii so that 

Tan = T 1 ®T 2 ®T 3 ®... 

where is the logical OR function. 

3) T a u is cleaned so that the time between two consecutive 
transients is greater or equal to the length of two large 
windows. 

The non-uniform STFT is therefore constructed by concatena- 
tion of the large-window STFT on all frames, plus of short- 
window STFT on transient frames in T a u. Figure [5] shows this 
dual-resolution STFT when a transient is detected. 

B. Experiments 

In order to evaluate the improvements brought by dual- 
resolution, we use the same sound samples as before : an 
electro-jazz piece of 15 seconds of music composed of 5 
sources. The same parameters are also used: 50 iterations, 
D = 40, p = 0.01, and two overlap values : 50% and 75%. 
The large and small window sizes are set to 2048 and 256 
samples, respectively. 

Results are presented on Figure [6] showing improvement 
over the Wiener filter as before. Note that we used the 
same Wiener filter reference (single-resolution) throughout this 
experiment. Transient detection with 50% overlap (leading 
to an increase in data size from 15 to 25%, depending on 
the number of detected transients), are close to the results 
obtained with an uniform STFT at 75% overlap (100% more 
data): transient detection brings the same separation benefits 
as increasing the overlap, with the added value of sharper 
transients. Audio examples are available on the demo web 
page |16|. 

V. Practical implementation in an ISS framework 

This section presents the new source reconstruction method 
in a full ISS framework. We call our method Informed Source 
Separation using Iterative Reconstruction (ISSIR). First the 
coding scheme will be presented, together with parameter 
tuning. Then, the decoding scheme will be presented. 

A. Coder 

Data coding is used to format and compact the information 
needed for the posterior reconstruction. The size of this coded 
data is of prime importance : 
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• In the case of watermarking within the mixture (which 
would then be coded in PCM), high capacity watermark- 
ing may be available [21], limited by a constraint of 
perceptual near-transparency. The lower the bit rate, the 
higher the quality of the final watermarked mixture, used 
for the source reconstruction. 

• In the case of a compressed file format for the mixture, 
the side-information could be embedded as meta-data 
(AAC allows meta-data chunks, for instance). In this case, 
the size of the data is also important in order to keep the 
difference between the coded audio file and the original 
audio file to a minimum. 

Of course, increasing the bit rate would eventually lead to 
the particular case where simple perceptual coding of all the 
sources (for instance with MPEG 2/4 AAC) would be more 
efficient than informed separation. 

In order to achieve optimal data compaction, we make the 
following observation: most of the music signals are sparse 
and mostly described by their most energetic bins. Therefore, 
spectrograms coding should not require the description of TF 
bins with an energy threshold T lower than e.g. -20dB below 
the maximum energy bin of the TF representation. What we 
propose is then to discard the bins that are lower than T in 
Energy. T is the first parameter to be adjusted in order to fit the 
target bit rate, with T < —20 dB. Note that former work, e.g. 
|6l, also threshold the spectrogram, but much lower in energy 
(-80 dB). The second parameter for data compaction is the 
quantization of the spectrogram with step it. As seen before, 
increasing u decreases the reconstruction quality but lowers 
the number of energy levels to be encoded. Since increasing 
u did not change much the entropy of the data distribution, 
we choose u = ldB for the whole experiment. The third 
parameter p used for the activity domain is set to .01 and is 
not modified in our experiments. 

The data size of the activity domain is then fixed throughout 
the experiments. In order to compact this information even 
more, we group time-frequency bins on the frequency scale 
using logarithmic rules similar to the Equivalent Rectangu- 
lar Bandwidth (ERB E2l ) scale. This psychoacoustic-based 
compression technique has also been used in informed source 
separation in 21, 0- For the experiments in this paper we 
use 75, 125 or 250 non overlapping bands on large windows 
(1025 coded bins) and 25 bands on small windows (129 coded 
bins), as presented on Figure [7] 

Additional parameters such as spectrogram normalization 
coefficients, STFT structure, transient location and quantiza- 
tion step are transmitted apart: such information represents a 
negligible amount of data as most of it is fixed for the whole 
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Fig. 8. Block diagram of the ISSIR coding framework. 



file duration. At the end of the coding stage, a basic entropy 
coding (in our experiment setup, we used bzip2) is added. 

Figure [8] shows the coding scheme, with the feedback loop 
for the adjustment of the model parameters to the target bit 
rate in kb/source/s. The target bit rate is a mean amongst the 
sources, as some sources will require more information to be 
encoded than others. Such framework allows mean data rates 
as low as 2kb/source/s. 



Wiener filtering [6 |, where JPEG image coding is simply used 
to encode the spectrograms. For a fair comparison, we also use 
this method with the same ERB-based filter bank grouping. For 
reference, we also compute the results of the original MISI 
method, with spectrogram quantization and coding. 

The test database is composed of 14 short monophonic 
mixtures from the Quaero databas^] from 15 to 40 s long, with 
various musical styles (pop, rock, industrial rock, electro jazz, 
disco) and different instruments. Each mixture is composed 
of 5 to 10 different sources, for a total of 90 source signals. 
The relation between the sources and the mixture is linear, 
instantaneous and stationary ; however, the sources include 
various effects such as dynamic processing, reverberation or 
equalization, so that the resulting mixtures are close to what 
would have been obtained by a sound engineer on a Digital 
Audio Workstation. 

Figure [9] presents the mean and standard deviation of the 
improvements over the oracle Wiener filter for the whole 
database. As before, SDR, SIR and SAR are used for the 
comparison of the different methods. Reported bit rates are 
averaged over the whole database, at a given experimental 
condition. Four mixtures under Creative Commons license are 
given as audio examples on the demo web page |16|: 

• Arbaa (Electro Jazz) - mixture nb. 2 - 5 sources 

• Farkaa (Reggae) - mixture nb. 4 - 7 sources 

• Nine Inch Nails (Industrial Rock) - mixture nb. 8 - 7 
sources 

• Shannon Hurley (Pop) - mixture nb. 12-8 sources 



B. Decoder 

The decoder performs all the previous operations back- 
wards. It first initializes each source using the log-quantized 
data and the phase of the mixture M. Then, the iterative 
reconstruction is run for K iterations and the signals are finally 
reconstructed using the decoded activity domain tyj. 

VI. Experiments 

In this section we validate our complete ISSIR framework 
on different types of monophonic mixtures. As the problem of 
informed source separation is essentially a tradeoff between 
bit rate and quality, we perform the experiments by setting 
different thresholds T and filter bank sizes for the single and 
dual window STFT algorithm presented before. The baseline 
for comparison is a state-of-the-art ISS framework based on 



A. Bit rates and overall quality 

As expected, increasing the bit rate improves the recon- 
struction on all criteria. The two ISSIR algorithms always 
outperform the baseline method of JSJ, although not signif- 
icantly at very low bit rates when the non-uniform filterbank 
is used. The dual-resolution framework requires more data, 
and only outperforms the single resolution algorithm for bit 
rates higher than lOkb/source/s, where the latter tends to reach 
its maximum of 1.7dB improvement over the oracle Wiener 
filter. At 32kb/source/s, the dual resolution method reaches its 
own maximum of approx. 3dB improvement over the oracle 
Wiener filter. For even higher bit rates, MISI gives significantly 
better results, but the high amount of total side information is 
not compatible with a realistic ISS usage. 

2 www.quaero.org 
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Fig. 9. Reconstruction results for the different methods, on monophonic mixtures at different bit rates. Results are given relative to the oracle Wiener filter. 
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B. Performance as a function of the sound file 



C. Computation time 



The previous experiments are associated with a strong 
variance: results are highly dependent both on the type of 



music and on the sources. Figure 10 presents the SDR results 



for the 14 sound files, at an average bit rate of lOkb/source/s. 
It can be observed that the variations are happening both from 
mixture to mixture and within the mixture. At this bit rate, the 
dual resolution algorithm may not always perform better than 
the single resolution algorithms, as can be seen for mixtures 
3, 5, 13, and 14. However, the proposed technique (single or 
dual) always outperforms the reference method of [6]. 



Since the proposed reconstruction algorithm is iterative, the 
decoding requires a heavier computation load than simple 
Wiener estimates. A Matlab implementation of the dual- 
resolution scheme led to computation times of 6 to 9 s per 
second of signal, for 50 iterations, on a standard computer. 

As a proof of concept, the single resolution iterative recon- 
struction was also implemented in parallel with the OpenCl 
ll23l API, using a fast iterative signal reconstruction 1111 , On 
a medium range graphic card, the computation time dropped 
to .3 to .4 s per second of signal. The adaptation of this 
fast scheme to the dual resolution case is, however, not 
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straightforward. 

D. Complex mixtures 

In the case of complex mixtures (multichannel, convolutive, 
etc), the main issue is the error distribution as in equation ((8), 
that requires itself a partial inversion of the mixing function. 
In fact, actual source separation is done at this level, and this 
paper shows that a simple binary mask at this stage is sufficient 
in order to achieve good results on monophonic mixtures. 
The framework presented in this paper could then be adapted 
for a vast variety of source separation methods, especially 
in the cases when the mixing function is known. In the 
case of multichannel mixtures, for instance, error repartition 
distribution be done using beamforming techniques. 

VII. Conclusion 

This paper proposes a complete framework for informed 
source separation using an iterative reconstruction, called 
Informed Source Separation using Iterative Reconstruction 
(ISSIR). In experiments on various types of music, ISSIR 
outperforms on standard objective criteria a state-of-the-art 
ISS technique based on JPEG compression of the spectrogram, 
and even the oracle Wiener filtering by up to 3dB in source- 
to-distortion ratio. 

Future work should focus on the optimization of the al- 
gorithm in order to lighten the computation load, and on its 
extension to multichannel and convolutive mixtures. Psychoa- 
coustic models should also be considered as a way to compact 
and shape the side information. Finally, formal listening tests 
should confirm the objective results, although it should be 
emphasized that setting up a whole methodology for such ISS 
listening tests (that is not established as in other fields, e.g., 
audio coding), is a work in itself that goes beyond the current 
study. 
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